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^ Abstract 

We propose a novel reformulation of the stochastic optimal control problem as an approximate infer- 
ence problem, demonstrating, that such a interpretation leads to new practical methods for the original 
problem. In particular we characterise a novel class of iterative solutions to the stochastic optimal con- 
trol problem based on a natural relaxation of the exact dual formulation. These theoretical insights are 
applied to the Reinforcement Learning problem where they lead to new model free, off policy methods 
for discrete and continuous problems. 
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1 Introduction 



In recent years the framework of stochastic optimal control (SOC) |27j has found increasing applicability in 
the domain of planning and control of realistic robotic systems \17\ 135] while also finding widespread use 
as one of the most successful normative models of human motion control [32l IS]- In general SOC can be 
summarised as the problem of controlling a stochastic system so as to minimise expected cost. The general 
problem subsumes a variety of different problems all based on slightly different assumptions, e.g. Markov 
Decision Processes [35] , Reinforcement Learning [55] or Adaptive Control. The increased use of the general 
formalism in high dimensional and non linear settings necessitates the development of novel efficient methods, 
while it's diverse nature makes novel theoretical insights into the general problem extremely desirable. 

In the most general setting, the stochastic optimal control problem with arbitrary dynamics and cost 
function is analytically intractable and significant previous research has focused on developing efficient ap- 
proximate solution methods [101 115j . In particular there have been, in recent years, an increasing number of 
attempts to relate the stochastic optimal control problem to problems from the domain of probabilist infer- 
ence, specifically maximum likelihood problems, e.g., [HIE], and inference problems [12] [33]. The hope was 
that by finding such correspondences, the large number of available efficient Machine Learning [4 a approaches 
will become applicable to the stochastic optimal control problem. 

In this paper we propose a reformulation of the general stochastic optimal control problem as a problem 
of approximate probabilistic inference. Unlike previous theoretical work on this issue [T3J [3TJ [T5J [THJ this 
reformulation is exact without making further assumptions, though this comes at the cost of a lack of a closed 
form solution. However, the exact reformulation of stochastic optimal control as an inference problem is, in 
itself, not the main motivation of this work. Rather we see it as a starting point for development of novel 
approaches to the problem, which draw from the alternative interpretation. We show for example that the 
reformulation can be directly related to the previously proposed approximate inference control framework 
[33] which allows us to clarify the relation of the latter to stochastic optimal control. 

Importantly we demonstrate that a, in the context of a probabilistic interpretation, natural relaxation 
of the new formulation directly leads to a novel class of iterative solutions for the stochastic optimal control 
problem. Wc characterise the form of these iterations and highlight their relation to previous applications 
of Expectation Maximisation algorithm in this area [3H [5] • We also directly demonstrate the applicability 
of these results, by deriving novel model free, off policy Reinforcement Learning algorithms for discrete and 
continuous problems. 

We would like to note that this text forms part of the first author's PhD progress report (May 2010) 
submitted to the University of Edinburgh Graduate School. This document aims to make this work available 
to a wider audience as we have been made aware of the recent work by Kappen et.al. [T] (pursued inde- 
pendently and in parallel) which shows distinct parallels to the methods developed here, with specifically 
the Dynamic Policy Programming (DPP) algorithm having significant overlap with the here proposed LS'I' 
algorithm, although the motivation and derivation differ. Furthermore we claim that the results presented 
here go beyond the work of [T] by providing a more general framework, relating it to previous approaches 
in Stochastic Optimal Control and Reinforcement Learning, and by demonstarting applicability of the algo- 
rithm to continuous problems. In particular we highlight a class of approximations which lead to analytical 
expressions in the continuous setting, mitigating the need to use computationally expensive numerical or 
Monte Carlo methods anticipated by pQ. 

The remainder of this paper is structured as follows. After introducing necessary concepts of stochastic 
optimal control in |section "2| we present in |scction "3| our theoretical results relating to the approximate 
inference formulation of stochastic optimal control problems. These are then applied in |section 4| to the 
Reinforcement learning problem. 

2 Preliminaries 

In the remainder of this text we will consider control problems which can be modeled by a Markov Decision 
Process (MDP) and before proceeding we first recall the standard formalism. We shall keep this exposition 
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rather brief, only introducing concepts necessary for the development of the theory and methods in this 
paper. For a broader review one may refer to the 1 st Year proposal or [29] . or for a more thorough treatment 
to any of the numerous text books on the subject, e.g., [27 l l28 l 13] . 

A MDP provides in general a model of a sequential decision process, where an agent observes it's state, 
chooses a control and then transitions to a new state whist incurring a certain cost. More formally, let 
x t G X be the state and ut G U the control signals at times t = 1, 2, ...,T. In order to simplify notation 
we will denote whole state and control trajectories xi...t,mo...t by x,u. Let P(xt+\\xt, ut) be the transition 
probability for moving from x t to Xt+i under control Ut and let Ct(x,u) > be the cost incurred for choosing 
control u in state x at time t. A policy for time step t, TT t (u t \xt), is the conditional probability of choosing the 
control u t given the state x t . In the interrest of a less cluttered notation we shall in the following in general 
drop the subscript t on tt if it is obvious from the context. An important family of policies is the set T> of 
deterministic policies, which are policies given by a conditional delta distribution, i.e. n(u t \xt) = ^u»=T(a; t ) 
for some function r. The stochastic optimal control problem consists of finding a deterministic policjrl which 
minimises the expected cost, i.e., solving 

tt* = argmin( ^^C t {x t ,u t ) ) , (1) 

7TGX> 




T 



where 

q^(x,u\x ) = ■n(u \x )'^TT(u t \x t )P(xt+i\xt,Ut) , (2) 
t=i 

is the distribution over trajectories with start state xq and under policy tt. 

In the case of an infinite time horizon, i.e. for T — > oo, we will restrict ourselves to the discounted cost 
formulation. That is we will assume the cost to be a discounted time stationary cost, so that C t (x tl u t ) = 
r y t C(x t ,u t ) for some discount factor 7 G [0, 1]. 

For a given policy tt we may define the value function J£ : X — > K, as the mapping from a state x to the 
expected cost of starting in x at time t and following it thereafter, i.e., 

Jt n ^) = CEC t (x t ,u t )\ . (3) 

\ k=t ' q 7t (x t +i...T, , u,t...T\xt=x) 

Similarly we may, for a given policy it, define the state-control, or state-action as it is more commonly known, 
value function : X x U K, which for a given x, u gives the expected cost of starting in state x at time 
t, choosing control u and following it thereafter, i.e., 

QUx,u) = (y2c t (x t ,u t )\ . (4) 

\ k =t I q 7r (x t + 1 __ T ,u t + 1 , T \x t =x,u t =u) 

Of obvious interest are the value and state action value functions of it* , which we denote by J* , Q* . They 
are sufficient, in the case of J* together with knowledge of the transition probability and cost function, 
to characterise the optimal policy. An equation of particular importance in this context is the Bellman 
optimality equation 



Jt(x t ) = min 
u t eu 



Ct(x t ,u t ) + I P(xt+i\xt,ut)j t * +1 (x t+ i) 

x t +l 



(5) 



which in the infinite horizon discounted cost setting gives the following fixed point equation for the optimal 
value function, 



J* (x) — min 

ueu 



C(x,u) +1 / P(y\x,u)J*(y) 



!) 



(6) 



1 n.b. it can be shown that for problems of the type described here there exists a optimal policy which is deterministic |29| 
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Although the Bellman equations are in general not analytically tractable, in either the finite or infinite horizon 
case, they have provided the starting point for a large number of approaches for solving the stochastic optimal 
control problem, and will indeed be closely related to the starting point of the formulation proposed in this 
paper. 

As an aside we note that throughout this paper we will in general be working under the more general 
assumption of infinite control and state spaces and hence use integrals, as has been already done in the 
Bellman equations. This is done with the understanding that for discrete problems these simply reduce to 
finite sums. 



3 On Stochastic Optimal Control and KL divergences 

We will now state our main theoretical results which will form the basis of the work presented in |section 4| 
and proposed future work. Specifically we will show how stochastic optimal control can be formulated 
as a approximate inference problem in a certain probablistic model. For the purpose of this paper, we 
define approximate inference, as the approximation of a true posterior within some family of distributions 
by minimization of some divergence measure. The divergence measure which we will consider here is the 
Kullback-Leibler divergence, which, for two distributions q & p over X , is defined as 

KL(„||p)= / 9(x)log q -^l . (7) 
Jx P\ x ) 

After introducing the probablistic model in |subsection 3.l| we will, in |subscction 3.2[ state and discuss our 
general duality result. As this result does not directly lead to a closed form solution of the stochastic optimal 



control problem we will then proceed do demonstrate in subsection 3.3 that under a relaxation of the exact 
dual a novel class of iterative approaches arises, which allows for closed form iterations. We will then derive 
such iterations for the finite and infinite horizon case. Finally we will discuss the relations of these results 
to previous work in the field. 



3.1 Bayesian Model of Control Problems 

In most general terms, we would define inference based control in terms of a Dynamic Bayesian Network 
which includes multiple state, task, and control variables in each time slice. We would distinguish three types 
of random variables, state and control variables, defined as in the stochastic optimal control framework, and 
additionally a set of variables, which we will refer to as task variables, which capture the achievement of 
the objective described by the cost. In general the states and controls are latent variables and we wish to 
marginalise the states and infer the controls. The task variables on the other hand are observed, in the sense 
that we aim to make inference about the controls in the case of an achieved task. 

More formally the model takes the form illustarted by the graphical model in figure |Figurc l) We relate 
the task likelihood to the classical cost by choosing 

P(r t = 1 1 x t ,u t ) = exp{-C t (x t ,u t )} , (8) 

which is well defined due to the restriction Ct(-, •) > 0. The complete joint is now given by 

T 

P(x, u, f | x Q ; 7r) = q^(x, u) P(r t \ u t ,x t ) , (9) 

t=o 

where q v , the trajectory distribution under a policy, has been defined previously in pj). As indicated, our 
main interrest will be with the posterior under the assumed observation of task success, and we will use the 
notation 

1 T 

p„(x, u | x ) = P{x, u I x , r = 1; n) = —— r&rfc, u) TT P(r t = l\u t , x t ) . (10) 

P(r = 1\xo;tt) f L 
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Figure 1: The graphical model of for the Baysian formulation of the control problem in the finite horizon 
case. In the infinite horizon case we obtain a stochastic markov process. 

We would like to note that we have presented a model of sufficient structure for the results which follow, 
with the understanding that in cases with additional structure, better algorithm and stronger results may be 
obtainable. In particular we make no further assumption about conditional independence structure which is 
often present between subsets of state and control variables. Furthermore we view a single task variable in 
each time step as a sufficient representative of any set of task variables one might conceive, e.g., endeffector 
targets variables, collisions etc.. Eventually, the observed task variables only induce extra potentials on the 
remaining state and control variables. Therefore one could avoid introducing them at all in the formalism. 
However, we find their notion helpful to develop the theory. 

3.2 General Duality 

We may now directly state the main result relating the discussed Bayesian model to stochastic optimal 
control. 

Proposition 1 (Rawlik & Toussaint [22j). Let ir° be an arbitrary stochastic policy and T> the set of deter- 
ministic policies, then the problem 

argminKL(g 7r ||p 7r o) (11) 

ttEV 

is equivalent to the stochastic optimal control problem with cost per stage 

C t (x t ,Ut) = C t (x t ,u t ) - log n°(u t \x t ) (12) 
Proof. Let n t (ut \ x t ) — 5 Ut=Tt ( Xt -), for some function r, then 

q w (x,u) 



KL^Ip^o) = log P(r = 1) + / / q n (x,u)\o, 



q^o (x, u) 

T 

j J q n (x) 7r(t2|x)^log 



(13) 



, n exp{-C t (x t ,u 4 )} 

T 

= log P(f = l|x ;7r°) + KL(q v (x, u)\\q n o(x, u)) + / / q„(x) 6 a=T ^) y^ y C t (x t , u t ) (14) 

Jx Ju t=Q 

f T 

= log P(f = l|x ;7T ) + KL(q w (x, u)\\q n o (x, u)) + / q w (x) y]C t (xt,rt(x t )) ■ (15) 

J* t=o 

Furthermore the divergence between the controlled process, q^, and prior process, q^o is 

KL(q n (x,u)\\p n o(x,u)) = / / q«(x,u) T")log „7 , '] ( 16 ) 

JxJu v {u t \x t ) 



T 

[q n (x)^logiT (T t (x t )\x t ) , (17) 

•' x 4=0 



5 



Hence, 

KLfeIKo) = log P(f = l|x ;7T ) + [C t (x t ,r t (x t ))-^ (r t (x t )|x t )]\ , (18) 

\ t=o I q7r 

and as log P(r — l|a;o; 7r°) is constant w.r.t. 7r, the result follows. □ 

As an immediate consequence we obtain the direct equivalent for a given stochastic optimal control 
problem by, 



Corollary. With 7r°(-|x) = U{-), where U(-) is the uniform distribution over U , the problem in (11) is 
equivalent to the stochastic optimal control problem. 

One should note, that in general the result requires the set of controls to be such as to allow a uniform 
distribution to be defined, i.e., either finite or bounded. This is however merely a theoretical consideration, 
and although we will formally limit ourselves to cases where it is satisfied, it is of little practical consequence. 

In general the presented reformulation of the stochastic optimal control problem will remain as intractable 
as the original formulation. In particular we can see that under the conditions of the corollary the KL 
divergence reduces directly to the Bellman equation ([5| plus a constant. This is a consequence of the fact 
that minimizing KL (gfp), whilst restricting q to be a delta distribution is equivalent to finding the maximum 
of p. Despite this intractability in the general case we argue that the presented formulation constitutes an 
interesting starting point for novel approaches to the problem. Both exact, iterative ones, as illustarted 
in the following section, but also approximate ones, as is the case with the Approximate Inference Control 



framework of Toussaint [33, 22 , to which we relate this result in subsubsection 3.4.1 



3.3 Iterative Solution 



From the Bayesian point of view the restriction to delta distributions in Proposition 1 seems rather unnatural 
and can, as mentioned previously, be seen as the main cause why the KL divergence remains intractable. A 
relaxation of this restriction, i.e. minimising w.r.t. to an arbitrary distribution n(-\xt), makes as we shall 
show, the minimization tractable and although it obviously does not lead directly to a optimal policy, we 
have the following result 

Proposition 2. For any tt ^ tt° , ~KL{q^\p^o) < KL (q^o jp^a ) implies (C(x,u)) q ^ < (C(x,u)) q Q . 
Proof. Expanding the KL divergences we have 

KL {q n (x,u) \q„o(x,u)) + (log P(r t = l\x, u)) qAs a) + log P(f = 1 |x ;7r°) 

< KL {q n o (jg, u) \q*o {x, u)) + (log P(f = l\x, u)) q ^ + log P{f = 1 1 x ; tt°) . (19) 

Subtracting log P(f — 1 1 xq; 7r°) on both sides and noting that KL (q^o (cc, M)|g w o(a;, u)) — 0, we obtain 

KL(q w (x,u)\\q n o(x,u)) + (log P(f =1^,^))^^ < (logP(f = ^%u)) q ^ m . (20) 

Hence, as KL {q„(x, u)||(? T o (x, u)) > with equality iff tt = 7r°, the result follows. □ 
As an immediate consequence, with some initial 7r°, the iteration 

7r l+1 «- argmin KL (q„ [jv ) , (21) 

7T 

with tt an arbitrarjj^] conditional distribution over u, gives rise to a chain of stochastic policies with ever 
decreasing expected costs. However we note that [Proposition 2| has rather weak conditions and we can 
generalise the iteration as followtj^] 

2 n.b. formally certain assumptions have to be made to ensure the support of q n is a subset of the support of p w i 
3 n.b. a more general formulation is possible, which does not require 7r l GV, however the presented formulation suffices for 
our purpose 



G 



Proposition 3. Let V be the set over all (stochastic) policies, if V 1 C V s.t. ir l G V % for all i, then the 
policies in the sequence generated by 



<- argmin KL (gv jp^t ) 



have non increasing expected costs. 



Proof. As 7T l € V\ KL (q^i+i\\p n z) < KL {q n i \p n % ) and hence either Proposition 2 applies or ir 



i _ _t+l 



(22) 



□ 



Note that this formulation admits (21 1 as a special case. A further interresting case is what we will refer 
to as asynchronous updates. These are updates of only one time step at each iteration in any particular 
order, i.e. choose a schedule of time steps t°, t 1 , . . . and let V % = {tt € V : Vi ^ i\ n t = tt\}. 

Naturally questions about the behaviour in the limit of iterations covered by Proposition 3 arise. As 
the expected cost is, under the assumption Ct(-) > (cf. section 2), bounded from below, we have as an 
immediate consequence 



Corollary. Any iteration of the form (22) converges. 

This obviously leaves open the more interresting question, if, under what conditions and in what sense the 
policy converges to an optimal policy. Although it would certainly be desirable to obtain a general answer 
to this question, currently we are concentrating on the specific cases of (21) and asynchronous updates for 



which we suggest that under weak conditions convergence to an optimal policy occurs (see Conjecture 5 
below). 

We will now proceed by first deriving specific updates for the finite horizon case, subsequently extending 
these to the infinite horizon, discounted cost setting. 



3.3.1 Finite Horizon Case 



As indicated previously the general minimization of iteration ( 21 ) can be performed analytically and here we 
provide the required derivation for the finite horizon case. To obtain the solution we bring the KL divergence 
into the recursive form 



KL(gr 7rI +i(a;,u)||p 7r i(a:,u)) = / ir l+ (u | x ) 



log 



TT 1+L (U \X ) 



tt^uq I x )P(r | x ,u ) 



+ / P(x\x ,uo)KL(q 7r i + i(x 2:T ,Ui :T \x 1 



and utilizing the following general result 



0„i(x 2 :T,Ui :T \xx = x)) 



(23) 



Lemma 4. Let a,b,c be random variables with joint P{a,b,c) = P(a)P(b\a)P(c\b,a) and V the set of 
distributions over a, then 



P(a) exp{ / P(b | a) log P(c = c | b)} oc argmin KL (q(a)P(b | a)\\P(a, b\c = c)) 
Jb qev 



(24) 



P(a) exp{ / P(b | a) log P(c = c \ b)} = minKL {q(a)P(b \ a)\\P{a, b\c = c)) 

qev 



Proof, see Appendix [A] 



(25) 

□ 



Specifically assume the minimised nested KL divergence for some time step t + 1 is given by some 
exp{^E'i + i(a; t+ i)}. Using the recursive formulation (23) and applying (24) with a = u t \xt, b = Xt+\ and 
P{c = c\b) = exp{^ t+ i(x t +i)}P(r t \x t , u t ), it is easy to see that the new policy is given by the Boltzmann 
like distribution, 

7r l+1 (u t \x t ) = exp{^ +1 (x t ,u t ) - V l+1 (x t )} , (26) 
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with energy 

% +1 {x tl ut) = log it 1 (u t \xt) + log P(r t = l\x t ,ut) 
and log partition function 



P(x t +i \x t ,u t )%+[{x t+ 



Xt+l 



®t(xt) = log / exp{*(:r t ,u)} 



(27) 



(28) 



Thus we can obtain the result for iteration (21) by applying (27) backwards in time, with = as 

the base case. Similarly asynchronous updates can be obtained by applying (27) only at one time step. 

We now turn to the question of the behaviour of these updates in the limit. Let us define the following 
restricted optimal policy 7f* 

Definition 1. Let Uf{x) CU be the support of TTj (• | x) and let U^{x) be the optimal controls at time t in 
state x. HUt(x) = U*(x) f)U®(x) is not empty, tt*(-\x) is defined as the uniform distribution over U£(x). 

Although we do not have any formal results yet, we suggest the following preliminary conjecture which 
we aim to complete in the near future 



Conjecture 5. Under weak assumptions, for both (21) and asynchronous updates, 

• 7T* converges weakly to 7f* 

• ^ t converges pointwise to —3% + c t , with the optimal value function and Ct a constant 



3.3.2 Infinite Horizon Case 

We will now consider the discounted infinite horizon setting. We proceed rather informally, but aim in future 
to formalise this setting as a limit case of the finite horizon setting. 

It is sufficient to only consider time stationary policies in this setting |~~ 
process is time stationary, and, with a slight abuse of notation, we have 



It is now easy to show that 
KLfe.+i(j;>2,M>i \ x x = 



q-n-ixyi, u >0 | x = x) = q- rT (x >2 ,u > i \ xi=x) 
i)IKi(x>2,«>i \ xi = x)) = 



Under such a policy the entire 

(29) 



jKL(q v i+i(x,u\x = x)lp n i(x,u\x = x)) , (30) 



which leads to the time stationary analog of (27), 



* l+1 (x,u) = log ir l (u\x) +log P(r = l\x,u) +7 / P(y | x, u)^ l+i (y) . 



(31) 



However due to the form of 4 rl+1 , this does not yield directly. Therefore we propose, in analogy to 

value iteration, e.g., [35], the update 

(x, u) <- V l (x,u) - V' (x) + log P(r = l\x,u) +7 / P{x'\x,u)¥{x') . 



(32) 

J x' 

which corresponds to the assumption that after one step the old policy it 1 = exp{^ l (a;, u) — fy l (x)} is 
followed. Although this update does not correspond to the iteration of (21), it can be constructed from 
a specific schedule of asynchronous updates. Specifically consider the schedule given with P ,k , where for 
each j = 1,2,... updates are performed at k = j,j — 1, j — 2, . . . , 0. It is easy to see that after each 
update the first step policy equals 7Tq, the policy obtained from (32). Hence as this iteration falls 
into the class of [Proposition 3| we immediately obtain the guarantee of non increasing expected costs and 
convergence. Furthermore we anticipate that the schedule will satisfy the weak conditions of [Conjecture 5| 
and its convergence to an optimal policy will directly follow. 
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3.4 Relation to Previous Work 



In the following we will relate the presented work in greater detail to three recent developments in the field. 
However we note that attempts to relate stochastic optimal control to inference have along history, in part 
motivated by the exact duality for the linear-quadratic-gaussian (LQG) case discovered by Kalmann 1271 . 
In general the idea of replacing costs, utilities or rewards by an auxiliary binary random variable has a long 
history [3[5S1[S]. Shachter & Peot even mention work by Raiffa (1969) and von Neumann & Morgenstern 
(1947) in this context. Although approaches have varied between using the interpretation of cost as energy, 
together with the typical identification of energy with negative log probability, as a has been done here, and 
choosing probabilities which are proportional to the reward or utility. 



3.4.1 Approximate Inference Control 

As the approximate inference control (AICO) framework was discussed in detail in the 1 st Year Report we 
will refrain from a full description here. In suffices to recall that AICO is formulated within the model 



described in subsection 3.1 and aims to find an approximation to p„o by a message passing approach similar 
to Expectation Propagation [16] . Although the original work [33] observed a close relation of the messages 
in the LQG case to the classical Riccatti [27 equations, no claims were made regarding stochastic optimality 
and it was suggested to choose the maximum a posteriori (MAP) controls. With the results presented in 



subsection 3.2 the relation of AICO to stochastic optimal control can now be clarified. Specifically a possible 
interpretation for AICO is to see it as finding an approximation to p v o, such as to make the KL divergence of 
proposition l| tractable. However, even under this interpretation we note that the result of the minimization 
of the KL divergence, even under a Gaussian approximation to p^o, are not the MAP control, rather one 
should solve the Ricatti equation arising from the approximation. 



3.4.2 Path Integral and KL control 

In recent years several groups were independently able to show that for a restricted class of stochastic optimal 
control problems the minimized Bellman equation ^ becomes linear and the problem admits a solution in 
closed form jTHl [13l [3TJ [12] . These linear Bellmann equations can be seen as a KL divergence [12] , leading to 
a close relation to the formulation in [Proposition 1| We will demonstrate this close relation in the discrete 
time case, leaving the continuous time case for future consideration as we have not yet developed it in our 
framework. 

Let us briefly recall the KL control framework of Kappen et.al. [T5], the alternative formulations of 
Todorov 30, 3L being equivalent. Choose some free dynamics i>a{xt+i\xt) and let the cost be given as 

C(i)=*(x)+5>g-^S (33) 

where v(xt+i\xt) is the controlled process under some policy. Then 

(C(x)) v = KL (v(x)\\v (x) exp{-£(x)}) (34) 

which is minimised w.r.t. v by 

v{x\...t\xq) = z exp{-£(xi... T )}vo{%i...T\ x o) (35) 
Z(x ) 

and one concludes that the optimal control is given by v(xt+i\xt), where presumably the implied meaning 
is that v(x t +i\x t ) is the trajectory distribution under the optimal policy. 



Although ( 35 ) gives a process which minimises ( 34 1 , it is not obvious how to compute actual controls 



from this process. Specifically when given a model of the dynamics, P{x t j r i\x t , u t ), and having chosen some 
vq, a non trivial, yet implicitly made, assumption is that 



3tt, 



.t. v(x t+ i\x t ) = / P(x t+1 \xt,u t )iT{ut\xt) . (36) 

J U+ 
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In fact in general such a it will not exists. This is made very explicit for the discrete MDP case in |30| . 
where it is acknowledged that the method is only applicable if the dynamics are fully controlable, i.e., 
P{xt+\ \xt,Ut) can be brought into any arbitrary form by the controlls. Although in the same paper it 
is suggested that solutions to classical problems can be obtained by continuous embedding of the discreet 
MDP, such an approach has several drawbacks. For one it requires solving a continuous problem even for 
cases which could have been otherwise represented in tabular form, but more importantly such an approach 
is obviously not applicable to problems which already have continuous state or action spaces. In the latter 
case Kappen et.al. claim (cf. section 4 of [12]) that the KL control approach is applicable if the problem is 
of the following form 

x t+1 =F(x t )+B(x t )(u t +Z), £~VV(0,Q), 
C t (x t ,u t ) = £{x t ) + u t Hu t , 
with J 7 , B and £ having arbitrary form, but H, Q are such that H™ 1 oc Q. We dispute this claim, showing 



that, in the discreet time case, (36) is not fullfillcd and that correcting the problem leads an equivalent of 
[Proposition 1| 

It will be sufficient to consider the simplest possible case of a one dimensional, one time step LQG 
problem. Let 

P(x t+ i\x t ,Ut) = N {x t+1 \x t +u t ;£) (38) 

and 

C t {x t ,u t ) = x t Rx t + u t S _1 Ut . (39) 

The claim made by Kappen et.al. is, that for vq = P(x t +i\xt,u t — 0), the KL formulation is equivalent to 
the corresponding stochastic optimal control problem. Or more specifically that 

v{x 1 \x ) oc P(a; 1 |a;o, u Q — 0) exp{— x\Rxi} = N (^ll^o! XT 1 + R) (40) 



gives the optimal controls, hence (36) should hold. In particular, as we know the LQG problem has a unique 



deterministic stochastic optimal control solution [27], there should be a deterministic tt s.t. (36) holds. But 



notice that we can not influence the variance of P(xt+i\xt, Ut) by specific choices of a deterministic 7r, hence 



(36) does not hold. Specifically v is not the trajectory distribution under the optimal policy. In fact there 



may not even be a stochastic policy s.t. (36 1 holds. Consider the case when the cost 'variance' R~ x is smaller 
then the variance of the noise, E. Then v{x\\x§ = 0) will have variance smaller then E. But even though 
with a stochastic policy the variance of the marginal process can increase, it can not decrease. 

The question now arises what controls should we choose? A principled choice would be to choose 7r to 
minimise a KL divergence. The first intuition would be to take 

&rgmmKL(P(x t+1 \x t ,u t = T(xt))\\v(x t+1 \x t )) . (41) 

T 

However noting that 

v(x t+ i\x t ) = — r vu(x t+1 \xt)Z(x t +i) (42) 
Z{x t ) 

= ^^ l/ ° (Xt + l|Xt)(eXP{_C(a;fc + 1:K)} ^0(x fc + 1: K| :Ct + 1 ) » ( 43 ) 

the KL divergence can be written as 

KL(P(x t+1 \x t ,u t = T{x t ))\\v {x t+1 \x t )) + (log Z{x t+1 )) p{xt+ilxtUt=T(xt)) - \ogZ(x t ) . (44) 
This is the correct expression for the expected cost, if log Z is the value function, however the latter is only 



the case if the normalized form of the KL divergence in ( 34 ) becomes zero at the minimum. Here we are 
specifically assuming this not to be the case, implying this formulation does not lead to stochastic optimal 
controls and we are therefore compelled to take 

argminKL (g 7r (x)||i/(x)) . (45) 
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This is very similar to the KL divergence in Proposition 1 In fact, under the conditions of the corollary to 
Proposition 1 and if the problem is of the form in (37), we can write 



Pna(x,u) = v(x) J|exp{(> f+1 - x t - T{x t )H 1 u t - ^u t H 1 u t )} (46) 



and the KL divergence of [Proposition l] can alternatively be written as 

KL (q n \\ Pw o) = KL (q„(x)\u(x)) - (^(ast+i ~x t - f{x t )T.- l u t - ^UtH^m)) . (47) 



Furthermore as for a deterministic policy, i.e. Tr(u t \xt) — <S Ut=T ( X{ ), 

((x t+ i -x t - f(x t )) qw = (ut) fc = r(x t ) , (48) 
we can see that the second term is zero under the condition H~ x = 2E , i.e. under the conditions required 



by Kappen et.al., and (45) is equivalent to the formulation in Proposition 1 



3.4.3 Expectation Maximization Approaches 

Several suggestions for mapping the stochastic optimal control problem onto a maximum likelihood problem 
and using Expectation Maximization (EM) have been recently made in the literature (341 [2] . Going further 
back the probability matching approach 8, 24j is also closely related to expectation maximization procedures. 

As one may suspect when considering (21) our approach has a close relation to the free energy view of 
EM [THIH]. In this view, EM alternates between minimizing KL (q(z)\\P(z\y; 9)) w.r.t. q, where z,y are the 
latent and observed variables and 9 the parameters, and maximizing the free energy, defined as 

C(q,6)= f q(z) log P{z ; y '' 6) (49) 

Jz viz) 



w.r.t. 9. In our case z and y correspond to x, u and f , while 9 corresponds to ir. It is easy to see that ( 21 ), or 
(22) for that matter, correspond to a generalized E-Step. The generalized indicates that only a partial step 
is performed, i.e., we are only lowering, rather then minimizing, KL (q(z)\\p(z\y; 9)) w.r.t. q. Furthermore 
the choice of ir l+1 corresponds to a generalized M-Step, as 



£( 97r * + i,7r = 7T i+1 )= / q^ + t\ogP(f=l\x,u) (50) 
>/ q^i+t log P{f=l\x,u)-KL{q^ w \q^) (51) 

J X,U 

= = . (52) 

Hence we conclude that our method corresponds to an generalized EM algorithm. 

Although we have shown that one can interpret the proposed approach in terms of EM we emphasise that 
it differs significantly from the applications of EM in previous work. For one we note that it is not our aim 
to find the maximum likelihood policy and in fact as we are using a generalized E-Step we lose the guarantee 
of convergence to a local maximum of the likelihood. In general maximizing the marginal log likelihood, the 
objective of EM, would not be desirable in our model, as despite the fact that for a given state and control 
trajectory, the classical cost and the task likelihood are directly related by 

C(x,u) = -log P(f= l\x,u) , (53) 

no such direct equality relation for the marginal likelihood can be obtained. Although using Jensen's in- 
equality we may obtain 

(C(S,«)> fc(Slfl) <-logP(r=l), (54) 
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this bound, which has also been previously observed in [33], is not necessarily tight and hence the optimal 
stochastic optimal solution does not necessarily coincide with the maximum likelihood solution. 

A more fundamental difference is that we can avoid finding an explicit representation for q n . In both the 
approaches of [M] and [5] calculating q n explicitly is a major computational step and presents a problem if 
these methods were to be applied to the continuous setting where may not be analytically tractable. 

Finally we anticipate that [Conjecture "5] will hold, giving a guarantee of convergence to an optimal policy 
which other EM methods can not provide. 

3.5 Conclusion 

The contribution of this section is a novel interpretation of the stochastic optimal control problem as an 
approximate inference problem and the derivation of a iterative solution to the control problem based on 
this new interpretation. The proposed approach has also been shown to have interresting links to other 
current research directions in the field. In particular we deomnstarte that the approach can be understood 
to underlie both the approximate inference control framework and, in the time discrete setting, the KL 
control framework. This theoretical work is intended to provides the foundation for the remainder of this 
paper and future work. 

4 Reinforcement Learning 

So far we have assumed the transition model and cost function are readily available. We now turn to the 
reinforcement learning setting [TTJ [351 [55] , where one aims to learn a good policy only given samples from 
the transition probability and associated incurred costs. 

We will demonstrate how the theoretical results previously derived can be applied to such problems 
yielding algorithms which are both model free and off policy. Model free indicates that the algorithm does 
not construct an explicit representation of the transition probability and cost function but rather directly 
learns a representation of the optimal policy. Off policy on the other hand means that the optimal policy 
can be learnt from samples collected under a different, often sub-optimal, policy. 

We will proceed by first deriving a tabular algorithm which is applicable for problems with small, finite, 
state and control spaces, before subsequently extending it to problems with continuous state and control 
spaces by using approximate parametric representations. Both algorithm are applied to classical problems 
in the field. 

4.1 Finite Problems 

Let us consider problems in the infinite horizon discounted cost setting and recall that the update function 
for ^ suggested in jsubsection 3.3| for this case was 

ty(x,u) <- ^(x,u) - $(x) +logP(r = l|a:,tt) +7 J P(x'\x,u)V(x') , (55) 

For any given x,u this update can be written as an expectation w.r.t. the transition probability P(y\x,u), 
and hence may be approximated from a set of sampled transitions. In particular given a single sample 
(x, u, £, y) of a transition from x to y under control u incurring cosl|^] £ we may perform the approximate 
update 

V(x,u) <- V(x,u) + [7*(j/) - * (x) -I] . (56) 

Given a stream of samples xq, uq, £qi aci, U\, £1, . . ■ we can then apply such an update for each tuple (xt,ut, £t,%t+i) 
individually. Without a particular justification we furthermore can introduce a decaying learning rate pa- 
rameter, similar to other reinforcement learning algorithms, in order to damp these updates. In practise 
however we did not find such a learning rate to improve results significantly. We call the resulting algorithm 
-learning. As indicated previously it is model free and can be employed for off policy learning. 

4 n.b. we assume we observe the cost, i.e., £ = — logP(r = u). 
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4.1.1 Relation to Classical Algorithms 

Before proceeding let us highlight certain similarities and differences between ^-learning and two classical 
algorithms, Q-learning and TD(0) [25] . 

As the name indicates, Q-learning learns the state-action value function (cf. Equation Q). We note 
that ^ has certain similarities to a Q function, in the sense that a higher value of ^ for a certain control 
in a given state indicates that the control is 'better'. In fact for the optimal controls the Q function and 
converge to the same valu^] However unlike the Q function, which also converges to the expected cost for 
the sub-optimal controls, "J goes to — oo for sub-optimal actions. A potentially more insightfull difference 
between the two algorithm is nature of updates employed. The Q-learning algorithms uses updates of the 
form 

Q(x,u) <- Q(x,u) +a I + 7 max Q(y,u') - Q(x,u) , (57) 
L u' J 

where a is a learning rate. Note that it will employ only information from one current control and the best, 
according to current knowledge, future control. The ^-learning algorithm on the other hand uses \E' which 
in some sense averages over information about the future according the current belief about the control 
distribution, rather then using single ^ values. 

A connection to the TD(0) algorithm which learns a value function is given by the form of the update. 
The TD(0) update has the form 

J[x)=J{x) + a[l + 1 J{y)-J{x)] (58) 



with a again a learning rate. We observe that as by Conjecture 5[ \& converges, up to a additive constant, 



to the value function of the optimal policy, the ^-learning update converges towards the TD(0) update for 
samples generated under the optimal policy. The emphasise is on, convergence to the TD(0) update, in 
general it will not correspond to an TD(0) update. In particular a important differences between the two 
algorithms is that TD(0) is a on-policy method, that is it learns the value function of the policy used to 
generate samples, while the proposed ^-learning is off-policy. 

4.1.2 Results 

Problems with finite state and action spaces allow ^ to be represented directly in tabular form. We evaluated 
such a tabular VE'-Iearning algorithm on the grid world domain [28] . Specifically we used the following task 
formulation. The state space is given by a N x N grid with some states occupied by obstacles. The controls 
allow the agent to transition to any neighbouring state not occupied by an obstacle or to remain at the 
current state. A transition to a neighbouring state succeeds with probability 0.8, with the agent remaining 
at the current location in case of failure. Choosing to remain in the current state succeeds with probability 
1. Additionally a set A C X of absorbing target states, i.e., 1 = P(x t+ i € A\xt € A,u £ U), is defined. In 
every time step a cost of 1 is incurred if the agent is in any state which is not a target state, while at a target 
state no cost is incurred, i.e., C(x, u) = S x ^^ with S the Kronecker delta. The cost was not discounted, i.e., 
7 = 1- 

We used tabular Q-learning, e.g., |28j . as a baseline. Both algorithms were run with controls sampled 
from an uninformed policy, i.e. a uniform distribution over the controls available at a state. Once a target 
state was reached, or if the target wasn't reached within 100 steps, the state was reset randomly. The 
learning rate for Q-learning decayed as a — c/(c + t) with t the number of transitions sampled and c a 
constant which was optimised manually. 

Representative results for a single instance of the general task are given in figure [2] We report the 
approximation error 

maxj; J{x) 



3 n.b. at the moment this is conjecture, as it is a consequence of [Conjecture 5 
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Figure 2: Results for tabular ^-learning on an example grid world problem, (a) the optimal value function 
(white low expected cost - black high expected cost) of the problem. Obstacles are black and the target state 
is indicated by *. (b) Evolution of the mean error in (59) averaged over 10 trials for each of the algorithms. 
Error bars indicate the standard deviation. 



between the true value function J , obtained by value iteration, and its estimate J, given by \& and 
max u Q( a;, it) respectively. Both algorithm achieved the same error at convergence. However ^-learning 
consistently outperformed Q-learning in terms of the number of samples required to convergence. We addi- 
tionally considered a greedy variant of ^-learning where the controls are sampled from the policy given by 
the current ^P, i.e. ir(u\x) = cxp{^>(x,u) — ^>(x)}. As expected we found that the greedy version greatly 
outperformed sampling using an uninformed policy. 



4.2 Continuous problems 

For continuous control problems, i.e. those with infinite state or controls sets, storing 'J in tabular form 
clearly becomes impossible and even for discreet problems it may be impracticable due to the size of the table 
required. In such cases, it is common to resort to parametric representations |29j . and here we follow such 
an approach to extend ^-learning to continuous problems. Although we will concentrate on the continuous 
case, we note that the proposed approach could also be employed for large discreet problems. 



4.2.1 The LS* algorithm 

Similar to numerous previous approaches [5, 20, 28, 29 we used a linear basis function model, to approximate 
i.e., 

M 

^(x, u) » ^>(x, u, w) = Wj4>(x, u) (60) 

where <pi : X x IA — > K are a set of given basis functions and w = (w-i, . . . , wm) is the vector of parameters 
we learn. For such an approximation and given set of samples (x\...KtU\...k^\...KtV\...k), the VP-learning 
update ( 56 ) can be written in matrix notation as 

$w 4+1 = $w J + z , (61) 
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where <!> is the K x M matrix with entries = <fii(xj,Uj) and z is the vector with elements 



From this we can obtain 
which suggests the update rule 



Zfc - 7*(l/fc) - 4 - *(z fc ) • (62) 
w i+1 - w l = ($ T $)- 1 $ T z , (63) 
we- w + ($ T $) -1 $ T z • (64) 



This is equivalent to computing the ^-learning update of (56) for the current approximation and projecting 
the result onto the space spanned by the basis functions in the least squares sense, an approach which has 
seen repeated use in reinforcement learning [T31 US] . We call the algorithm resulting algorithm, which, as 
tabular ^-learning, is a model free, off-policy method, Least Squares -learning (LS^). 

The choice of basis functions for LS^ is somewhat complicated by the need to evaluate the log partition 
function of the policy Vf, i.e. log J exp{^(x, u)}, when forming the vector z. In cases where U is a finite set, 
arbitrary basis functions can be chosen as the integral reduces to a finite sum. However for problems with 
infinite control spaces one needs to ensure the bases are chosen such that the arising integral is analytical 
tractable, i.e. the partition function of the stochastic policy can be evaluated. One class of basis sets for 
which this is the case, are those for which ^f(x, u, w) has the form 

^>(x, u, w) = — -u T K.(x, w)u + u T k(:r, w) + k(x, w) (65) 

where K(x, w) is a positive definite matrix. For such a set the integral is of the Gaussian form and the 
closed form solution 

log / exp{*} = - log |K| - -k'K _1 k + k + constant (66) 

Ju 2 

is obtained. Obviously the implication of such a basis set is that the policies are restricted to conditional 
Gaussian distributions. Specifically the policy is given by 

ir(u\x, w) = jV(tt|K -1 k, K- 1 ) . (67) 

Such Gaussian policies are commonly employed in the continuous reinforcement learning setting, e.g., [Bll20j. 
and we emphasise that, as the state dependent part of the basis is largely unrestricted, this general class of 
basis sets does not seem unreasonably restrictive. 



4.2.2 Results 

We demonstrate the applicability of LS^ on a pole on cart task [33] , which has been repeatedly used as a 
benchmark in reinforcement learning 21, 23] . The task consists of balancing a inverted pendulum mounted 
on a cart by exerting forces on the latter. The state space is given by x = (x,x,9,9), with x the position 
of the cart, 9 the pendulums angular deviation from the upright postion and x, 9 their respective temporal 
derivatives. Following j20j we use a form of the dynamics linearised around the zero state. The approximate 
dynamics are given by P(xt+i\xt,ut) — Af(xt+i\Axt + bit t ,E), where 
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(68) 



and r = l/60s, v = 13. 2s~ 2 , g = 9.8ms 2 , E = diaflr(Q.001, 0.001, 0.001, 0.001). The cost is given by 
C(x, u) = x t Qx + jiRu, with Q = diag(1.25, 1, 12, 0.25) and R = 0.01, and was unlike in [20] not discounted, 
i.e, 7=1. 
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Figure 3: Result for LS* for the cart on pole task, (a) Evolution of the error in the policy defined as the 
L2 norm of the difference between learned and optimal gains for 10 random trails, (b) The evolution of the 
expected cost averaged over the 10 trials. The dashed line indicates the expected cost of the optimal policy. 
Error bars indicate standard deviation, (c) Average length of episodes in the 10 trials, averaged over blocks 
of 100 episodes. 
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As the problem is LQG a set of polynomial basis functions is sufficiently rich to capture it, and we applied 
LS<i> with bases 

{u 2 , ux, ux, u9, u9 , x 2 , xx, x9, x9, x 2 ,x9, ±9, 9 2 , 99, 9 } . 



This set is of the form required by (65 1, if Wi is negative. Although we employed no formal means to ensure 
this, empirically we found that if initialised to a negative value, W\ would remain negative. Specifically we 
used the initialisation w = (—0.1,0, ... ,0) corresponding to an initial uniformed policy, i.e. a zero mean 
Gaussian with large variance. Using a random initialisation, whilst ensuring w\ was negative did not affect 
the results significantly, although convergence times increased if the initial policy was far from the optimum 
and had a low variance. It is worth noting that the initial policy did not asymptotically stabilise the system, 
this is in contrast to [HI H3Q 

We applied LS^P following an episodic sampling procedure. Starting from a start state, drawn from 
A/ r (xg|0, So), with Eo = diag(0.5, 0, 0.1, 0), a state, control & cost trajectory was sampled according to the 
transition probability and cost function. The required controls were sampled according to the policy arising 
from the current w, with a fixed baseline added to the variance. The latter proved necessary as otherwise 
the updates tended to become numerically unstable once the policy began to converge. A trajectory was 
terminated when it left the acceptable region given by 

- tt/6 < 9 < vr/6 and - 1.5m < x < 1.5m , (69) 

as in [20] . or after 100 time steps. We updated w after every 10 episodes. 

As the problem is LQG, the optimal policy is linear and can be computed directly. We can therefore asses 
the behaviour of LS'J directly, by measuring the error in the policy approximation during learning process. 
The results in figure [3](a) , where we plot the policy error defined as the L2 norm of the difference between 
the optimal gains and the LS\f r estimate, demonstrate that LS^P can successfully find near optimal gains. As 
a, in the literature, more commonly reported metric of the quality of an RL algorithm is the expected cost 
under the policy it learns. In figure [3jb) we therefore plot the evolution of expected costs. Note that as the 
expected cost under certain policies for this problem is not finite, we plot a Monte Carlo estimate calculated 
from a set of 100 trajectories with 200 steps eacrj^] As a reference we also plot the expected cost under the 
optimal policy. This data confirms the results of the policy error analysis, i.e, that LS^ converges towards 
a near optimal policy. As an aside we note that these results are comparable in terms of the convergence 
time to the best performing methods in [3T] were the same problem was used for evaluation. Unfortunately 
we were, so far, not able to directly reproduce these results in order to obtain a direct comparison. The 
similar convergence times are in particular surprising as LS^ started with a substantially worse initial policy. 
While [3T] seem to have constrained the initial policies to be stable, the initial LS^ policy was unstable. 
This initial instability of the controlled system is illustrated in figure [3](c) , where we plot the average length 
of the episodes used during learning. As can be seen the episodes under the initial policy are significantly 



shorter then the maximum length, indicating that the constraints in (69) are frequently violated. However 
after about 600-700 episodes a stabilising policy is learnt. 

4.3 Conclusion 

The contribution of this section is a novel type of reinforcement learning algorithms, which we obtained by 



direct application of the theoretical insights of section 3 We were able to demonstrate that the proposed 



algorithm successfully solves classical problems. However we acknowledge that the performance compared 
to the state of the art remains to be investigated and we refrain from a full discussion until such data has 
been obtained. 



6 |20| did not require the initial policy to be stable, however a discounted cost was used and the initial policy was restricted 
to give 7 — 2 > eig(A — bK) with K the control gains, i.e . th e policy had give rise to a well defined value function 
7 n.b. for the evaluation we did not apply constraints K9t 
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Appendices 



A Supplementary proofs 



Lemma ( |Lcmma~4"] in |subsubsection 3.3. l\ . Let a, b, c be random variables with joint P(a, b, c) = P(a)P(b \ a)P(c \ b, a) 
and V the set of distributions over a, then 



and 



P(a) cxp{ / P(b | a) log P{c = c \ b)} oc argminKL (q{a)P(b \ a)\\P(a, b\c = c)) 

Jb q&V 



P(a) exp{ / P(b | a) log P(c = c\b)} = minKL (q(a)P{b \ a)\P(a, b\c = c)) 
b i eV 



Proof. We form the Lagrangian 



C = KL (q(a)P(b\a)\\P(a, b\c = c)) + A 

q(a)P(b\a 



b q{a)ma)l ° g P(a)P(b\a)P(c = e\b) 



q(a) - 1 
+ A 



9(a) - 1 



q{a) log 



g( a ) 
P(a) 



q(a)P(b\a) log P(c = c\b) 



(70) 
(71) 

(72) 
(73) 
(74) 



where we use = to indicate equality up to an additive constant. Setting the partial derivatives w.r.t. q(a) 
to gives 



= log + 1 - / P(b\a) log P(c = c\b) + A 
p ( a ) Jb 



log 



9(a) 

Z(A)P(a) exp{/ 6 P(b\a) log P(c = c\b)} ' 



(75) 
(76) 



where Z is a function of the lagrange multiplier. The result in ( 70 ) now directly follows and more specifically 
the minimizer is 



P(a) exp{J b P(b\a) log P(c = c\b)} 
J a P(a)exp{f b P(b\a)logP(c = c\b)} 



The result in ( 71 ) can now easily be obtained by substituting q* into the KL divergence. 



(77) 

□ 
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